Feat: Add api to get machines with leaks#570
Feat: Add api to get machines with leaks#570srinivasadmurthy wants to merge 72 commits intoNVIDIA:mainfrom
Conversation
Matthias247
left a comment
There was a problem hiding this comment.
I don't know about the exact use-case for this.
But I'd prefer not to add APIs for searching for additional alerts for specific alert types, and instead rather extending the search filter passed to FindMachineIds to support searching by health probe IDs. That would be more universal and requires no new API.
|
@Matthias247 @kensimon Thanks for your review feedback. I have implemented the suggest changes and am requesting a re-review. |
crates/api-db/migrations/20260316223033_machines_health_override_gin.sql
Outdated
Show resolved
Hide resolved
|
I'm going to quote this comment from @srinivasadmurthy to get a discussion going:
I really think if the goal here is to respond to leak alerts and shut machines off, having two different layers of polling (having to wait for the health monitor to scrape sensors from a very unreliable BMC API, then having to wait for RLA to pick up the results from the health monitor) is likely not going to be fast enough. You'd have to have an unreasonably fast polling interval to catch the alert in time to do something about it, and the cost of that is likely too much in a larger datacenter with lots of machines. It seems like it'd be better for health events to stream directly to RLA, so that the instant a health override is added to carbide, it's also forwarded to RLA which can act on it directly, bypassing the polling altogether. Is this something we've thought about? |
Yes that's the long-term intention. In short-term, people were not sure about what health streaming/push mechanism should be, hence we opt to do this query model for now. It will still be very useful for other non-handling purpose (e.g., we will check if any tray in a rack has a leak before turning on host on a tray). But yes, handling scenario will switch to a fast method if needed. |
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Drive by to ensure crates/systemd/src/systemd.rs is parsable on non-linux systems (macOS...). This is a no-op for linux systems. ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [X] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes: NO. Signed-off-by: Patrice Breton <pbreton@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
… even though there are alerts. (NVIDIA#515) ## Description Fixes missing `host_machine_id` label in DPU logs and `telemetry_stats_log_records_count` metric by fetching the id through carbide API in forge-dpu-agent using the FindInterfaces request. The label is needed for `SuppressExternalAlerting` to work with the `noDpuLogsWarning` alert. The request is retried if the id isn't immediately available, using the `backon` crate to increase the retry interval to a maximum of every 5 minutes. Adds support for pending file contents to the `duppet` crate. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) https://nvbugspro.nvidia.com/bug/5668278 ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) Manual testing in local dev to verify - retry on failure - `/run/otelcol-contrib/host-machine-id` is created/updated/unchanged as expected. ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> --------- Signed-off-by: Tom Erickson <terickson@NVIDIA.COM> Co-authored-by: Ken Simon <ken@kensimon.io> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description This PR adds NVUE telemetry collection for NVLink Switches to the health service in a new collector: - NvueRest (HTTP polling) It is disabled by default and configurable (polling interval, request timeouts, and enablement of telemetry per path). Path enablement: - system_health_enabled: Poll [/nvue_v1/system/health](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/system/getSystemHealth) - cluster_apps_enabled: Poll [/nvue_v1/cluster/apps](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/cluster/getClusterApps) - sdn_partitions_enabled: Poll [/nvue_v1/sdn/partition](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/partition/getSdnPartitions) - interfaces_enabled: Poll [/nvue_v1/interface](https://docs.nvidia.com/networking-ethernet-software/nvos-api-25024300/#/interface/getInterface) ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes ~~Manual testing needed on rack (and soon to come).~~ Tested and working with `nvue_v1` running on NVOS 25 --------- Signed-off-by: Ivan Anisimov <ianisimov@nvidia.com> Co-authored-by: Ivan Anisimov <ianisimov@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description <!-- Describe what this PR does --> part two of dpf sdk refactor. this moves internal state handling largely to the dpf operator and lets the sdk trigger events when the state handler loop should act on dpu state changes. still using a custom bfb config and preloaded systemd services. adds dts as the first crd managed dpu service. holding back on using the other dpu services (present in other branch) until we can decide how those should be configured. the reasoning is not to break existing functionality from milestone 1 as those dpu services are untested. when all services are over, we should be able to remove the (backported) tera config. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> Closes FORGE-7959 ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> --------- Signed-off-by: fspitulski <fspitulski@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…ot running (NVIDIA#559) adds "update" to the stop command. This makes sure that supervisord knows about the dhcp server config before the stop is issued. this is important on newly provisioned DPUs since the dhcp server config doesn't exist initially. Also stops fetching the timestamps when dhcp is not running to avoiding noise in the logs ## Description <!-- Describe what this PR does --> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [X] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [X] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…n gRPC APIs and a new DB table. (NVIDIA#490) ## Description SPIFFE JWT-SVID machine identity support: add per-tenant identity configuration and token delegation via gRPC APIs and a new DB table. --- ## 1. Database Layer (`api-db`) ### New: `crates/api-db/src/tenant_identity_config.rs` - `TenantIdentityConfig` struct for per-org identity config - `set()` – upsert identity config (issuer, audiences, TTL, signing key) - `find()` – fetch config by org - `delete()` – remove config - `set_token_delegation()` – set token exchange config (endpoint, auth method, client secret) - `delete_token_delegation()` – clear delegation config - Placeholder key generation (no real encryption yet) ### New: `crates/api-db/migrations/20260225120000_tenant_identity_config.sql` - `tenant_identity_config` table with: - **Identity:** `issuer`, `default_audience`, `allowed_audiences`, `token_ttl`, `subject_domain_prefix`, `enabled` - **Signing:** `encrypted_signing_key`, `signing_key_public`, `key_id`, `algorithm`, `master_key_id` - **Timestamps:** `created_at`, `updated_at` - **Delegation:** `token_endpoint`, `auth_method`, `encrypted_auth_method_config`, `subject_token_audience`, `token_delegation_created_at` - FK to `tenants(organization_id)` with `ON DELETE CASCADE` --- ## 2. gRPC API (`rpc`) ### New Proto Messages - `GetIdentityConfiguration` / `SetIdentityConfiguration` / `DeleteIdentityConfiguration` - `GetTokenDelegation` / `SetTokenDelegation` / `DeleteTokenDelegation` - Messages: `GetIdentityConfigRequest`, `IdentityConfigRequest`, `IdentityConfigResponse`, `TokenDelegationRequest`, `TokenDelegationResponse`, `GetTokenDelegationRequest` --- ## 3. API Handlers (`handlers/identity_config.rs`) ### New: `crates/api/src/handlers/identity_config.rs` (657 lines) - `get_identity_configuration` – read config by org - `set_identity_configuration` – upsert config with org validation - `delete_identity_configuration` – delete config - `get_token_delegation` – read delegation config - `set_token_delegation` – upsert delegation config - `delete_token_delegation` – clear delegation ### Helper Functions - `compute_secret_hash()` – SHA256 hash for secrets - `truncate_hash_for_display()` – truncate hash for display - `struct_to_json` / `json_to_struct` – protobuf ↔ JSON - `build_response_auth_config()` – omit secrets from responses ### Unit Tests (10) - `compute_secret_hash`, `truncate_hash_for_display` - `struct_to_json`, `json_to_struct`, `json_to_struct_roundtrip` - `build_response_auth_config_omits_client_secret`, `truncates_hash`, `passes_through_non_secret`, `non_object_returns_clone` --- ## 4. Configuration (`cfg/file.rs`) ### New: `MachineIdentityConfig` - `enabled`, `algorithm`, `token_ttl_min`, `token_ttl_max`, `token_endpoint_http_proxy` - New `[machine_identity]` section in `CarbideConfig` --- ## 5. Integration Test Support (`api-test-helper`) ### New: `crates/api-test-helper/src/identity_config.rs` - `set_identity_configuration()`, `get_identity_configuration()`, `delete_identity_configuration()` - `set_token_delegation()`, `get_token_delegation()`, `delete_token_delegation()` - Uses grpcurl for gRPC calls --- ## 6. Integration Tests (`api-integration-tests`) ### `run_identity_config_tests()` in `tests/lib.rs` - Runs after tenant creation in `test_integration` - Sets config → get → delete - Sets config again → sets token delegation → get delegation → delete delegation --- ## 7. Fixes ### `api_fixtures/mod.rs` - Added `machine_identity: MachineIdentityConfig::default()` to `get_config()` in `CarbideConfig` --- ## 8. Documentation ### `book/src/design/machine-identity/spiffe-svid-sdd.md` - SDD for SPIFFE JWT-SVID machine identity - Architecture, config flows, token delegation --- ## Files Changed (identity_config-related) | File | Change | |------|--------| | `crates/api-db/src/tenant_identity_config.rs` | New | | `crates/api-db/migrations/20260225120000_tenant_identity_config.sql` | New | | `crates/api/src/handlers/identity_config.rs` | New | | `crates/api/src/handlers/mod.rs` | Register handler | | `crates/api/src/handlers/machine_identity.rs` | Modified | | `crates/api/src/api.rs` | Route new RPCs | | `crates/api/src/cfg/file.rs` | Add `MachineIdentityConfig` | | `crates/api/src/tests/common/api_fixtures/mod.rs` | Add `machine_identity` | | `crates/api-test-helper/src/identity_config.rs` | New | | `crates/api-test-helper/src/lib.rs` | Export `identity_config` | | `crates/api-integration-tests/tests/lib.rs` | Add `run_identity_config_tests()` | | `crates/rpc/proto/forge.proto` | New RPCs and messages | | `book/src/design/machine-identity/spiffe-svid-sdd.md` | Updated | ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) NVIDIA#447 ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [x] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes This PR is part of larger feature implementation related to NVIDIA#261. <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Some Lenovo platforms (HS350X V3) do not support SOL over SSH, so for those we switch to IPMI if site-explorer reports `LenovoAMI` as the BMC vendor. See: NVIDIA#528 Note: These machines still report as `Lenovo` in the DMI data. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…NVIDIA#573) ## Description This PR removes the lenovo-specific `boot_first(Pxe)` from the instance `invoke_power()` call. When neither `boot_with_custom_ipxe` or `run_provisioning_instructions_on_every_boot` flag is set, the user expects a normal OS reboot, not a forced network boot. Boot-order verification and correction for network-boot flows is now handled by the state machine when required, so this lenovo override is unnecessary in the default reboot path. This covers the following scenarios: 1. Disk first: the customer OS boots immediately. 2. DPU/network first: the host attempts PXE, then exits/falls through to the installed OS. 3. Custom install required, handled by the state machine: a. If network boot is already first, the state handler proceeds with the install flow. b. If disk is first, the state handler corrects boot order (making network/DPU first) and then proceeds with the install flow. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) NVIDIA#530 ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description **Problem**: Machines can become unresponsive (e.g., left in BIOS menu, OS issues, power faults) while they are in the `Ready` state, creating silent failures that go undetected by carbide until instance creation attempts fail. A metric was created for this issue but a host's health state never reflected the timeout, so allocations weren't blocked. **Fix**: Add scout heartbeat timeout health alert for machines in the `Ready` state: - Create a merge health override when `last_scout_contact_time` exceeds `scout_reporting_timeout` (default 5 minutes) - Remove the override automatically when scout heartbeat recovers in Ready - Clear the scout heartbeat timeout alert whenever the host transitions out of Ready, so stale alerts do not leak across state changes. - Continue emitting `hosts_with_scout_heartbeat_timeout` metric - By default, the alert does not block allocations and suppresses external alerting. To change this behavior, set the following in the carbide config: ``` [host_health] # Set to true to block allocations on hosts with scout heartbeat timeout (default: false) prevent_allocations_on_scout_heartbeat_timeout = true # Set to false to include these hosts in the unhealthy hosts Prometheus alert (default: true) suppress_external_alerting_on_scout_heartbeat_timeout = false ``` ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> --------- Signed-off-by: Krish Dandiwala <kdandiwala@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…VIDIA#371) ## Description This builds on the [firmware management work](NVIDIA#323) (and `ApplyFirmware`) to additionally implement `ApplyProfile` within the DPA provisioning workflow (it has been stubbed out with placeholders). The `ApplyProfile` state now handles `mlxconfig` profile management -- resetting the device's `mlxconfig` parameters to factory defaults between tenancies, and then optionally applying a named `MlxConfigProfile` if one is configured for the interface. This behavior of reset + apply updated values is the recommended guidance from NBU. High level changes include: 1. New `mlxconfig_profile` column on `dpa_interfaces` -- an optional profile name that maps into `carbide-api`'s `mlxconfig_profiles` config map. 2. Reworking the `OpCode::ApplyProfile` variant to carry an `Option<SerializableProfile>` (mirroring how `ApplyFirmware` carries a `FirmwareFlasherProfile`). 3. `carbide-api`-side config lookup + serialization in `build_apply_profile_command`. 4. `scout`-side implementation in `mlx_device::apply_profile()`. 5. Corresponding State Controller updates to handle both the reset-only and reset + profile sync workflows. In this workflow: 1. We check the interface's `mlxconfig_profile` field. 2. If `None`, we send `ApplyProfile { serialized_profile: None }`, and `scout` will reset to factory defaults (to prepare for the next tenant) and report success. 3. If set, we look it up in the `runtime_config.mlxconfig_profiles` map, serialize it via `SerializableProfile::from_profile()`, and send it down to `scout`. 4. If the profile name is set, but can't be found in config, we return an error rather than sending `None` (which would silently reset without applying any intended profile(s)). 5. `scout` always resets mlxconfig to factory defaults first, then applies the profile if one was provided, and reports back via `MlxObservation`. The `ApplyProfile` state handler was also broken out into its own `handle_apply_profile()` function, making it independently testable without needing the full async state controller scaffolding. I need to go back and do this in a few other pre-existing places. Existing tests updated as needed, and new tests introduced. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [x] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Update to new name ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [X] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [X] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description <!-- Describe what this PR does --> delete dpf-sdk crate, rename dpf-sdk-beta crate to dpf-sdk, update references to dpf-beta to dpf. fix license headers to apache 2. (old) dpf-sdk crate is already unused. this is just cleanup. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> --------- Signed-off-by: fspitulski <fspitulski@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description This continues the work from NVIDIA#606, NVIDIA#602, NVIDIA#598, NVIDIA#596, NVIDIA#608, and NVIDIA#610. TLDR is we had leaked a few things from `::rpc` into the `api-db` layer, which we generally don't want to do, and now that we have `STYLE_GUIDE.md`, it was good to practice what we preach. Everything else has been refactored. This handles the last bit of it, and then kicks `carbide-rpc` out of the `carbide-api-db` crate entirely. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
On numerous occasions, I have wanted to go into the admin UI to look at
DHCP lease information directly, but it's not there.
And I wanted to do it again today. And as usual, it's not there.
And I was like well, if it's not there, we should add it.
But what else should we add?
And then I thought, you know, we probably could use an IPAM section in
general, and have it be a place to look at:
- DHCP allocations (because we have `carbide-dhcp`).
- Authoritative DNS entries (because we have `carbide-dns`).
- Underlay networks/prefixes (because we manage those).
- Overlay networks/prefixes (and we manage those too).
Right now this is just taking care of the DHCP part, and I'm adding
placeholders for DNS and networks.
DHCP details include:
```
struct DhcpEntryDisplay {
ip_address: String,
mac_address: String,
machine_id: String,
hostname: String,
created: String,
last_dhcp: String,
last_dhcp_rfc3339: String,
}
```
I hope people like this idea, because it's going to make me a lot
happier.
Signed-off-by: Chet Nichols III <chetn@nvidia.com>
## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)
## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->
## Breaking Changes
- [ ] This PR contains breaking changes
<!-- If checked above, describe the breaking changes and migration steps
-->
## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)
## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
… to the admin UI (NVIDIA#627) ## Description Now that we're starting to turn up rack components (`Rack`, `PowerShelf`, `Switch`) in Carbide, and since we have `ExpectedRack`, `ExpectedPowerShelf`, and `ExpectedSwitch`, it makes sense to have these available in the admin UI under the `Rack`, `Power Shelf`, and `Switch` sections, which right now just have managed ones, and not the expected details. This adds all of that, including linked information, and the status of explored/adopted components. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description In another review, @Matthias247 pointed out that we could/should probably have a pattern where `Args` can just `.into()` (or `.try_into()?`) the underlying `Request` that they exist to populate. Most cases of this are super straightforward, so I'm doing that to some more of them (in addition to the ones I've already implemented). At the end of the day, a "command" is now something like.. ``` let req = args.try_into()?; let resp = api_client.0.call(req).await?; ..do something w/resp ``` ...and probably allows us to get even deeper into how we templatize things in the admin CLI. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description This new architecture document describes how various network partitioning technologies (DPUs, IB and NVLink) are integrated into NICo. It also acts as a guideline that future integrations should follow. ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Add mock hardware support for NVIDIA DGX H100 systems including: - 8x H100 80GB HBM3 GPUs with HGX chassis - 2x ConnectX-7B quad-port InfiniBand NICs - 1x ConnectX-7A dual-port storage NIC - 1x Intel E810 dual-port storage NIC - 1x Intel X550 management NIC - 1x BlueField-3 DPU (fixed count of 1) New NIC hardware modules: - nic_intel_e810: Intel E810 dual-port NIC - nic_intel_x550: Intel X550 NIC - nic_nvidia_cx7: ConnectX-7 variants (CX7A dual-port, CX7B quad-port) Additional changes: - Add AMI BMC vendor support with /SD settings path - Make manager eth_interfaces and firmware_version optional - Make NIC serial_number optional for Intel NICs without serials - Add fixed_number_of_dpu() for platforms with fixed DPU count - Add ok_no_content() helper for PATCH responses returning 204 - Add bmc_redfish_version() per hardware type ## Type of Change - [x] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Dmitry Porokh <dporokh@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description update helm/.github codeowner ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Add handling for force-delete and link the doc in the style guide ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Currently WorkLockManager is explicitly cancelled with the toplevel CancellationToken. But handles to it would still exist in things like the state controllers, which might need to finish their current iteration before cancelling. So instead of explicitly cancelling WorkLockManager when the toplevel cancellation signal is received, let it cancel only after all handles are dropped (which they should be once all the dependents are done and drop their handles.) With this approach, the cancel signal will explicitly cancel the API listener and all the background controllers, which will drop the `Arc<Api>` handle, which will free the last handle to WorkLockManager, which will shut it down. The toplevel JoinSet still requires all tasks to be complete, so we still rely on all of this happening before the API server actually shuts down. (Currently only integration tests use the expilcit shutdown signal, but adding a proper "graceful shutdown on SIGINT" handler can be done trivially as a followup.) ## Type of Change - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [X] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes See discussion in NVIDIA#586 for details. Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Update typo to DCO section Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…IDIA#288) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Removed old `DpuSsh` API, and adds Generic `BmcCredentials` API. As for now it only supprots `UsernamePassword` credentials type, but in future should support `SessionTokens` as well. ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [x] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) NVIDIA#460 ## Breaking Changes - [x] This PR contains breaking changes Removes old DpuSSH credentials API. ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: ianisimov <ianisimov@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Continuation of NVIDIA#628. Wanted to do a bit first, and then do some more here, mainly since it's a lot to look at. TLDR is that we have a pattern where `Args` can just `.into()` (or `.try_into()?`) the underlying `Request` that they exist for. These next refactorings are also pretty straightforward, but I had punted them to a separate PR because some of them weren't AS straightforward as the original ones. At the end of the day, a "command" is now something like.. ``` let req = args.try_into()?; let resp = api_client.0.call(req).await?; ..do something w/resp ``` ..or in many cases (thanks @poroh for the callout here).. ``` api_client.0.call(args).await?; ``` ...and probably allows us to get even deeper into how we templatize things in the admin CLI. ## Description <!-- Describe what this PR does --> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description This adds an introductory DNS section to the admin UI that includes: - Zones we are authoritative for. - The records we currently serve (including record information). There will be some iterative improvements here, but I want to get something out for people to work with and look at and go "oh we should do X and Y instead." Some improvements would probably be: - Pagination (either server or client side). - Filtering (either server or client side). - Better integration with the global search box (if it doesn't exist yet). - Etc. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description
This introduces an Overlay Networks section to the IPAM section in the
admin UI.
The drilldown is:
- `/admin/ipam/overlay`
- `/admin/ipam/overlay/prefix/{prefix-id}`
- `/admin/ipam/overlay/segment/{segment-id}`
When you go to the main **Overlay Networks** page, you get a table with:
- Name
- VNI
- Prefixes
You can then click on one of the prefixes for that VNI, which brings you
to a **Prefix** view showing:
- Prefix
- Name
- Gateway
- Allocated IPs
- Segments (prefix)
You can THEN click on one of the segments in that VNI prefix, which
brings you to a **Segment** view showing:
- Segment
- Parent Prefix
- Table of allocated IPs (and the current instance its allocated to)
This **ALSO** adds the **Underlay** section, which I was going to keep
separate, and then just decided to make it all one PR.
The drilldown is:
- `/admin/ipam/underlay`
- `/admin/ipam/underlay/segment/{segment-id}`
I was hoping to reuse the segment view, but since the data is different,
it's two separate ones right now.
When you go to the main **Underlay Networks** view, you get a table
with:
- Name
- Prefix
- Type (admin, underlay, host_inband).
- Gateway.
- Allocated IPs.
And then when you drill down to the segment, you get the parent prefix
and a table of allocated:
- IP
- MAC,
- Machine
Signed-off-by: Chet Nichols III <chetn@nvidia.com>
## Type of Change
<!-- Check one that best describes this PR -->
- [x] **Add** - New feature or capability
- [ ] **Change** - Changes in existing functionality
- [ ] **Fix** - Bug fixes
- [ ] **Remove** - Removed features or deprecated functionality
- [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.)
## Related Issues (Optional)
<!-- If applicable, provide GitHub Issue. -->
## Breaking Changes
- [ ] This PR contains breaking changes
<!-- If checked above, describe the breaking changes and migration steps
-->
## Testing
<!-- How was this tested? Check all that apply -->
- [ ] Unit tests added/updated
- [ ] Integration tests added/updated
- [ ] Manual testing performed
- [ ] No testing required (docs, internal refactor, etc.)
## Additional Notes
<!-- Any additional context, deployment notes, or reviewer guidance -->
Signed-off-by: Chet Nichols III <chetn@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description I don't know how this actually snuck through -- build + tests passed? All said, it needs to go! This was the old placeholder for the IPAM admin UI section. Now that all of the sub-sections have been filled in, this isn't in use, and is causing errors for being dead code. Removing both the struct and it's associated HTML file. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [x] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…IA#656) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…IA#656) Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…#654) Signed-off-by: Andrew Forgue <aforgue@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…o 90 minutes (NVIDIA#650) ## Description This state requires the following operations - power-cycle the host - CheckHostConfig - ConfigureBios & PollingBiosSetup - SetBootOrder Power-cycle the host and SetBootOrder are time-consuming some-times depending on machine configurations. Some machines could take 20min for first try of SetBootOrder operations. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [x] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description This was discussed at length internally amongst [DSX](https://nvidianews.nvidia.com/news/nvidia-releases-vera-rubin-dsx-ai-factory-reference-design-and-omniverse-dsx-digital-twin-blueprint-with-broad-industry-support) and [NCX](https://docs.nvidia.com/ncx/index.html) teams. While historically [what is now known as] NICo has always generated its own internal ID for *components*, a rack ID is kind of a grey area between a component and an identifier. You can think of a rack as a supercomputer, or something akin to a blade server, BUT, you can also think of it as a place where components are stored. With that, we decided the `RackId` should be a `String` whose SoT comes from the DCIM (Datacenter Inventory Manager) system, since that is how the BMS (Building Management System) identifies racks, and events for racks, including things like leak detection. Instead of NICo generating its own stable `RackId` and maintaining mapping between it's internal rack ID and the DCIM rack ID, it was decided the `RackId` should just come from the DCIM as part of "expected component" ingestion: `ExpectedRack`, `ExpectedMachine`, `ExpectedPowerShelf`, and `ExpectedSwitch` entities will all be enqueued with the DCIM from the `RackId`. This ultimately allows for the DCIM rack ID to be the one and only SoT, and allows for all components in a DSX AI Factory to agree without confusion from alternative IDs. So, strip away the hardware-backed `RackId` plans, and move towards a newtype over `String`, allowing the DCIM to provide whatever it wants. This change is backwards compatible: - The database already stores as text, not a `uuid`, so we don't have to worry about conversion issues. - We are moving from an "encoded" value to an open `String` value, so even pre-existing encoded `RackIds` work with the `String`-backed newtype. - The gRPC `common.RackId` is still the same. It was a `String` to begin wtih. - The JSON serialization of it is still the same. We use `#[serde(transparent)]`, so it's still just `"rack_id": "whatever-id-you-want"`. The downside is now that it's a `String`, we can't `Copy`, so we have to `.clone()` in certain cases. I pass around by reference as much as I can, but not everywhere. Also added some tests to ensure: - The old format still works. - New strings work. - Empty strings don't work. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [x] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…IA#661) ## Description Now machine-a-tron supports handful of different types of hardware and it is useful to see information about hardware type of specific machine in TUI. ## Type of Change - [x] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) ## Breaking Changes - [ ] This PR contains breaking changes ## Testing - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [x] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes Signed-off-by: Dmitry Porokh <dporokh@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
## Description Don't need these. Noticed them when doing some other work. Longer technical explanation for those who don't know is that `.to_string()` takes `&self`, and creates a new `String` from it, so we're making a `.clone()` of something that would just be taken as a reference anyway. Signed-off-by: Chet Nichols III <chetn@nvidia.com> ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [ ] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [x] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [ ] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) ## Additional Notes <!-- Any additional context, deployment notes, or reviewer guidance --> Signed-off-by: Chet Nichols III <chetn@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
…A#620) ## Description Fixes a gap where extension services marked for removal were not cleaned up from instance config even after all DPUs reported successful termination. Previously, terminated extension service cleanup was only executed in the `WaitingForExtensionServicesConfig` instance state; however, extension service config updates do not transition the machine out of Ready (only tenant state moves to `Configuring`). This change adds instance extension service config cleanup in the `Ready` state so terminated services can be cleaned up. ## Type of Change <!-- Check one that best describes this PR --> - [ ] **Add** - New feature or capability - [ ] **Change** - Changes in existing functionality - [x] **Fix** - Bug fixes - [ ] **Remove** - Removed features or deprecated functionality - [ ] **Internal** - Internal changes (refactoring, tests, docs, etc.) ## Related Issues (Optional) <!-- If applicable, provide GitHub Issue. --> ## Breaking Changes - [ ] This PR contains breaking changes <!-- If checked above, describe the breaking changes and migration steps --> ## Testing <!-- How was this tested? Check all that apply --> - [x] Unit tests added/updated - [ ] Integration tests added/updated - [ ] Manual testing performed - [ ] No testing required (docs, internal refactor, etc.) Signed-off-by: Felicity Xu <hanyux@nvidia.com> Signed-off-by: Srinivasa Murthy <srmurthy@nvidia.com>
3cff21a to
4689c04
Compare
|
Yes. I had not signed the commits, and followed github's recommendations to sign them (which included a rebase). That seems to have messed up the PR. I am going to create a new branch and create a new PR and discard this PR. |
Description
Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes
Tested by setting debug features cpu2temp_alert and leak_alert in crates/health/Cargo.toml.
Setting these generate relevant overrides and used grpcurl to test GetHardwareLeaksReport API.